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Abstract — Let P = be a measure of strictly positive 

probabilities on the set of nonnegative integers. Although the 
countable number of inputs prevents usage of the Huffman 
algorithm, there are nontrivial P for which known methods 
find a source code that is optimal in the sense of minimizing 
expected codeword length. For some applications, however, a 
source code should instead minimize one of a family of nonlinear 
objective functions, /3-exponential means, those of the form 
log a 53jP(*) a > where n(i) is the length of the ith codeword 
and a is a positive constant. Applications of such minimizations 
include a novel problem of maximizing the chance of message 
receipt in single-shot communications (a < 1) and a previously 
known problem of minimizing the chance of buffer overflow 
in a queueing system (a > 1). This paper introduces methods 
for finding codes optimal for such exponential means. One 
method applies to geometric distributions, while another applies 
to distributions with lighter tails. The latter algorithm is applied 
to Poisson distributions and both are extended to alphabetic 
codes, as well as to minimizing maximum pointwise redundancy. 
The aforementioned application of minimizing the chance of 
buffer overflow is also considered. 

Index Terms — Communication networks, generalized en- 
tropies, generalized means, Golomb codes, Huffman algorithm, 
optimal prefix codes, queueing, worst case minimax redundancy. 



I. Introduction, Motivation, and Main Results 

If probabilities are known, optimal lossless source coding 
of individual symbols (and blocks of symbols) is usually 
done using David Huffman's famous algorithm [1]. There 
are, however, cases that this algorithm does not solve. Prob- 
lems with an infinite number of possible inputs — e.g., 
geometrically-distributed variables — are not covered. Also, 
in some instances, the optimality criterion — or penalty — 
is not the linear penalty of expected length. Both variants of 
the problem have been considered in the literature, but not 
simultaneously. This paper discusses cases which are both 
infinite and nonlinear. 

An infinite-alphabet source emits symbols drawn from the 
alphabet = {0,1,2,...}. (More generally, we use X to 
denote an input alphabet whether infinite or finite.) Let P = 
{p(i)} be the sequence of probabilities for each symbol, so that 
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the probability of symbol i is p(i) > 0. The source symbols are 
coded into binary codewords. The codeword c(i) <G {0, 1}* in 
code C, corresponding to input symbol i, has length n(i), thus 
defining length distribution N. Such codes are called integer 
codes (as in, e.g., [2]). 

Perhaps the most well-known integer codes are the codes 
derived by Golomb for geometric distributions [3], [4], and 
many other types of integer codes have been considered by 
others [5]. There are many reasons for using such integer codes 
rather than codes for finite alphabets, such as Huffman codes. 
The most obvious use is for cases with no upper bound — or 
at least no known upper bound — on the number of possible 
items. In addition, for many cases it is far easier to come up 
with a general code for integers rather than a Huffman code for 
a large but finite number of inputs. Similarly, it is often faster 
to encode and decode using such well-structured codes. For 
these reasons, integer codes and variants of them are widely 
used in image and video compression standards [6], [7], as 
well as for compressing text, audio, and numerical data. 

To date, the literature on integer codes has considered 
only finding efficient uniquely decipherable codes with respect 
to minimizing expected codeword length '^2nP{i)n{i). Other 
utility functions, however, have been considered for finite- 
alphabet codes. Campbell [8] introduced a problem in which 
the penalty to minimize, given some continuous (strictly) 
monotonic increasing cost function f(x) : R + — ► R + , is 

L(P,N,<p)=<p- 1 ^X>(»Mn(i))J 

and specifically considered the exponential subcases with 
exponent a > 1: 



L a (P,N) ^ log a ]Tp(i)a 



(1) 



that is, (f(x) = a x . Note that minimizing penalty L is also an 
interesting problem for < a < 1 and approaches the standard 
penalty 2~ZiPW n (*) f° r a — > 1 [8]. While (f(x) decreases 
for a < 1, one can map decreasing vp to a corresponding 
increasing function (p(l) = ip max - <p(l) (e.g., for ip max 
1 ) without changing the penalty value. Thus this problem, 
equivalent to maximizing p(i)a n ( l \ is a subset of those 
considered by Campbell. All penalties of the form ([TJ are 
called /3-exponential means, where j3 = log 2 a [9, p. 158]. 

Campbell noted certain properties for /3-exponential means, 
but did not consider applications for these means. Applications 
were later found for the problem with a > 1 [10]— [12]; these 
applications all relate to a buffer overflow problem discussed 
in Section [Vl 
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Here we introduce a novel application for problems of the 
form a < 1. Consider a situation related by Alfred Renyi, 
an ancient scenario in which a rebel fortress was besieged by 
Romans. The rebels' only hope was the knowledge gathered 
by a mute, illiterate spy, one who could only nod and shake 
his head [13, pp. 13-14]. This apocryphal tale — based upon a 
historical siege — is the premise behind the Hungarian version 
of the spoken parlor game Twenty Questions. A modern 
parallel in the 21 st century occurred when Russian forces 
gained the knowledge needed to defeat hostage-takers by 
asking hostages "yes" or "no" questions over mobile phones 
[14], [15]. 

Renyi presented this problem in narrative form in order 
to motivate the relation between Shannon entropy and binary 
prefix coding. Note however that Twenty Questions, traditional 
prefix coding, and the siege scenario actually have three 
different objectives. In Twenty Questions, the goal is to be 
able to determine the symbol (i.e., the item or message) 
by asking at most twenty questions. In prefix coding, the 
goal is to minimize the expected number of questions — 
or, equivalently, bits — necessary to determine the message. 
For the siege scenario, the goal is survival; that is, assuming 
partial information is not useful, the besieged would wish 
to maximize the probability that the message is successfully 
transmitted within a certain window of opportunity. When 
this window closes — e.g., when the fortress falls — the 
information becomes worthless. An analogous situation occurs 
when a wireless device is losing power or is temporarily within 
range of a base station; one can safely assume that the channel, 
when available, will transmit at the lowest (constant) bitrate, 
and will be lost after a nondeterministic time period. 

Assume that the duration of the window of opportunity is 
independent of the communicated message and is memoryless, 
the latter being a common assumption — due to both its 
accuracy and expedience — of such stochastic phenomena. 
Memorylessness implies that the window duration is dis- 
tributed exponentially. Therefore, quantizing time in terms of 
the number of bits T that we can send within our window, 

P(T = t) = (1 - a)a\ t = 0,1,2,... 

with known positive parameter a < 1. We then wish to 
maximize the probability of success, i.e., the probability that 
the message length does not exceed the quantized window 
length: 

oo 

¥[n(X) < T] = ^ P(T = t) ■ P[n(X) < t] 
t=o 

oo 



t=0 



iex 



5>(i).(l-a) ]T a* 

t—n(i) 



£>(*K«-(l-a)5> 4 

iex 



t=o 



where l n (,)<t is 1 if n(i) < t, otherwise. Minimizing ([]]) is 
an equivalent objective. 

Note that this problem can be constrained or otherwise 
modified for the application in question. For example, in some 
cases, we might need some extra time to send the first bit, or, 
alternatively, the window of opportunity might be of at least 
a certain duration, increasing or reducing the probability that 
no bits can be sent, respectively. Thus we might have 

P(T = t) = { (1 - i )(l - a)a'-\ 1=1,2,... 
for some to G (0, 1). In this case, 

F[n(X) <T}= (1 ~ to) 



iex 



and the maximizing code is identical to that of the more 
straightforward case. Likewise, if we need to send multiple 
messages, the same code maximizes the expected number of 
independent messages we can send within the window, due to 
the memoryless property. 

We must be careful regarding the meaning of an "optimal 
code" when there are an infinite number of possible codes 
under consideration. One might ask whether there must exist 
an optimal code or if there can be an infinite sequence of codes 
of decreasing penalty without any code achieving the limit 
penalty value. Fortunately the answer is the former, the proof 
being a special case of Theorem 2 in [16] (a generalization of 
the result for the expected-length penalty [17]). The question 
is then how to find one of these optimal source codes given 
parameter a and probability measure P. 

As in the linear case, a general solution for (Q~|i is not 
known for general P over a countably infinite number of 
events, but methods and properties for finite numbers of events 
— discussed in the next section — can be used to find 
optimal codes for certain common infinite-item distributions. 
In Section[ni] we consider geometric distributions and find that 
Golomb codes are optimal, although the optimal Golomb code 
for a given probability mass function varies according to a. 
The main result of Section|III]is that, for p$(i) = (l — 6)8 l and 
a E R + , Gk, the Golomb code with parameter k, is optimal, 
where 

k = max (1, [- log„ a - log„(l + 6)]). 

In Section |IV] we consider distributions that are relatively 
light-tailed, that is, that decline faster than certain geometric 
distributions. If there is a nonnegative integer r such that for 
all j > r and i < j, 



p(i) > max p(j), 



E 

k=j+i 



P (k} 



,k-j 



iex 



then an optimal binary prefix code tree exists which consists 
of a unary code tree appended to a leaf of a finite code tree. 
A specific case of this is the Poisson distribution, p\(i) = 
A*e /£!, where e is the base of the natural logarithm (e w 
2.71828). We show that in this case the aforementioned r is 
given by r = max(|~2aA] — 2, \e\] — 1). An application, that 
of minimizing probability of buffer overflow, as in [11], is 
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considered in Section [V] where we show that the algorithm 
developed in [11] readily extends to coding geometric and 
light-tailed distributions. Section [VI] discusses the maximum 
pointwise redundancy penalty, which has a similar solution 
for light-tailed distributions and for which the Golomb code 
Gk with k = [~— l/log 2 #] is optimal for with geometric 
distributions. We conclude with some remarks on possible 
extensions to this work. 

Throughout the following, a set or sequence of items x(i) 
is represented by its uppercase counterpart, X. A glossary of 
terms is given in Appendix [XT] 

II. Background: Finite Alphabets 

If a finite number of events comprise P (i.e., \X\ < oo), the 
exponential penalty (fTJ is minimized using an algorithm found 
independently by Hu et al. [18, p. 254], Parker [19, p. 485], 
and Humblet [20, p. 25], [11, p. 231], although only the last 
of these considered a < 1. (The simultaneity of these lines of 
research was likely due to the appearance of the first paper 
on adapting the Huffman algorithm to a nonlinear penalty, 
maxi(p(i) + n(i)) for given p(i) G R + , in 1976 [21].) We 
will use this finite-alphabet exponential-penalty algorithm in 
the sections that follow in order to prove optimally for infinite 
distributions, so let us reproduce the algorithm here: 

Procedure for Exponential Huffman Coding (finite al- 
phabets): This procedure finds the optimal code whether a > 
1 (a minimization of the average of a growing exponential) 
or a < 1 (a maximization of the average of a decaying expo- 
nential). Note that it minimizes (fl]i, even if the "probabilities" 
do not add to 1. We refer to such arbitrary positive inputs as 
weights, denoted by w(i) instead of p(i): 

1) Each item i has weight w(i) £ Wx, where X is the 
(finite) alphabet and Wx = { w (i)} is the set of all such 
weights. Assume each item i has codeword c(i), to be 
determined later. 

2) Combine the items with the two smallest weights w(j) 
and w(k) into one compound item with the combined 
weight w(j) = a-(w(j)+w(k)). This item has codeword 
c(j), to be determined later, while item j is assigned 
codeword c(j) = c(j)0 and k codeword c(fc) = c{j)\. 
Since these have been assigned in terms of c(j), replace 
w(j) and w(k) with w(j) in Wx to form W x . 

3) Repeat procedure, now with the remaining codewords 
(reduced in number by 1) and corresponding weights, 
until only one item is left. The weight of this item 
is Yli w{i)a n ^ . All codewords are now defined by 
assigning the null string to this trivial item. 

This algorithm assigns a weight to each node of the resulting 
implied code tree by having each item represented by a 
node with its parent representing the items combined into its 
subtree, as in Fig. [T] If a node is a leaf, its weight is given 
by the associated probability; otherwise its weight is defined 
recursively as a times the sum of its children. This concept is 
useful in visualizing both the coding procedure and its output. 

Van Leeuwen implemented the Huffman algorithm in linear 
time (to input size) given sorted weights in [22], and this 
implementation was extended to the exponential problem in 
[23] as follows: 



Two-Queue Implementation of Exponential Huffman 
Coding: The two-queue method of implementing the Huffman 
algorithm puts nodes/items in two queues, the first of which is 
initialized with the input items (eventual leaf nodes) arranged 
from head to tail in order of nondecreasing weight, and the 
second of which is initially empty. At any given step, a node 
with lowest weight among all nodes in both queues is at the 
head of one of the two queues, and thus two lowest-weighted 
nodes can be combined in constant time. This compound node 
is then inserted into (the tail of) the second queue, and the 
algorithm progresses until only one node is left. This node is 
the root of the coding tree and is obtained in linear time. 

The presentation of the algorithm in [23] did not include a 
formal proof, so we find it useful to present one here: 

Lemma 1: The two-queue method using the exponential 
combining rule results in an optimal exponential Huffman code 
given a finite number of input items. 

Proof: The method is clearly a valid implementation of 
the exponential Huffman algorithm so long as both queues' 
sets of nodes remain in nondecreasing order. This is clearly 
satisfied prior to the first combination step. Here we show that, 
if nodes are in order at all points prior to a given combination 
step, they must be in order at the end of that step as well, in- 
ductively proving the correctness of the algorithm. It is obvious 
that order is preserved in the single-item queue, since nodes are 
only removed from it, not added to it. In the compound-node 
queue, order is only a concern if there is already at least one 
node in it at the beginning of this step, a step that combines 
nodes we call node i_i and node i_2- If so, the item at the tail 
of the compound-node queue at the beginning of the step was 
two separate items, z_3 and i_4, at the beginning of the prior 
step. At the beginning of this prior step, all four items must 
have been distinct — i.e., corresponding to distinct sets of 
(possibly combined) leaf nodes — and, because the algorithm 
chooses the smallest two nodes to combine, neither z_3 nor 
i_4 can have a greater weight than either i_i or i_2. Thus 
— since a ■ (w(i-3) + w(i-4)) < a ■ (w(i-i) + and 
the node with weight a ■ (w(i-3) + w(i-i)) is the compound 
node with the largest weight in the compound-node queue at 
the beginning of the step in question — the queues remain 
properly ordered at the end of the step in question. ■ 

If a < 0.5, the compound-node queue will never have more 
than one item. At each step after the first, the sole compound 
item will be removed from its queue since it has a weight less 
than the maximum weight of each of the two nodes combined 
to create it, which in turn is no greater than the weight of any 
node in the single-item queue. It is replaced by the new (sole) 
compound item. This extends to a = 0.5 if we prefer to merge 
combined nodes over single items of the same weight. Thus, 
any finite input distribution can be optimally coded for a < 0.5 
using a truncated unary code, a truncated version of the unary 
code, the latter of which has codewords of the form {V0 : j > 
0}. The truncated unary code has identical codewords as the 
unary code except for the longest codeword, which is of the 
form {ll^l -1 }. This results from each compound node being 
formed using at least one single item (leaf). Taking limits, 
informally speaking, results in a unary limit code. Formally, 
this is a straightforward corollary of Theorem [2] in Section ITVl 
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If a > 0.5, a code with finite penalty exists if and only 
if Renyi entropy of order a(a) = (1 + log 2 a) 1 is finite, as 
shown in [16]. It was Campbell who first noted the connection 
between the optimal code's penalty, L a (P,N*), and Renyi 
entropy 



Theorem 1: For a £ 



if 



H a (P) 4 

=►#«(„) (P) = 
This relationship is 



-J-log 2 ^p(i) a 
1 — a f—f. 



iex 



1 + log 2 a 
log 2 a 



log 2 ^p(i)( 1 + lo ^«)- 1 . 



H a(a) {P) < L a (P,N*) < H a[a) {P) + 1 

which should not be surprising given the similar relationship 
between Huffman-optimal codes and Shannon entropy [24], 
which corresponds to a — ► 1 (a — > 1) [8], [25]; due to this 
correspondence, Shannon entropy is sometimes expressed as 
H X (P). 

III. Geometric Distribution with Exponential 
Penalty 

Consider the geometric distribution 

p (i) = (1 - 9)9' 

for parameter 9 £ (0, 1). This distribution arises in run-length 
coding among other circumstances [3], [4]. 

For the traditional linear penalty, a Golomb code with 
parameter fc — or Gfc — is optimal for 9 k + 9 k+1 < 1 < 
Qk-i _|_ 0fc 5 uc jj a coc [ e consists of a unary prefix followed 
by a binary suffix, the latter taking one of fc possible values. 
If k is a power of two, all binary suffix possibilities have the 
same length; otherwise, their lengths a(i) differ by at most 1 
and 2~ CT W = 1. Binary codes such as these suffix codes 
are called complete codes. This defines the Golomb code; for 
example, the Golomb code for k = 3 is: 



i 


p(i) 




c(i) 





1 - 9 







1 


(1-0 


)0 


10 


2 


(1-9 


)9 2 


11 


3 


(1-9 


)9 3 


10 


4 


(1-9 


)9 i 


10 10 


5 


(1-9 


)9 5 


10 11 


G 


(1-9 


)9 6 


110 


7 


(1-9 


)° 7 


110 10 


8 


(1-9 


)9 8 


110 11 


9 


(1-9 


)9 9 


1110 



where the space in the code separates the unary prefix from 
the complete suffix. In general, codeword j for Gk is of the 
form { 1 Li/ fc J 06(j mod fc, k) : j > 0}, where b(j mod k, k) is 
a complete binary code for the (j — k[j/k\ + l)th of k items. 

It turns out that such codes are optimal for the exponential 
penalty: 



< 



1 



< 



afe-1 



(2) 



for k > 1, then the Golomb code Gfc is the optimal code for 
Pg. If no such fc exists, the unary code Gl is optimal. 

Remark: This rule for finding an optimal Golomb Gfc code 
is equivalent to 

fc = max (1, |~- log e a - log e (l + 0)~|) . 

This is a generalization of the traditional linear result, which 
corresponds to a — > 1. Cases in which the left inequality is 
an equality have multiple solutions, as with linear coding; see, 
e.g., [26, p. 289]. The proof of the optimality of the Golomb 
code for exponential penalties is somewhat similar to that 
of [4], although it must be significantly modified due to the 
nonlinearity involved. 

Before proving Theorem [U we need the following lemma: 
Lemma 2: Consider a Huffman combining procedure, such 
as the exponential Huffman coding procedure, implemented 
using the two-queue method presented in the previous section 
just prior to Lemma [T] Now consider a step at which the 
first (single-item) queue is empty, so that remaining are only 
compound items, that is, items representing internal nodes 
rather than leaves in the final Huffman coding tree. Then, in 
this final tree, the nodes corresponding to these compound 
items will be on levels differing by at most one; that is, the 
nodes will form a complete tree. Furthermore, if n is the 
number of items remaining at this point, all items that finish 
at level |~log 2 n] appear closer to the head of the (second, 
nonempty) queue than any item at level [~log 2 n] — 1 (if any). 

Proof: [Lemma O We use an inductive proof, in which 
the base cases of one and two compound items (i.e., internal 
nodes) are trivial. Suppose the lemma is true for every case 
with n—1 items for n > 2, that is, that all nodes are at levels 
[log 2 (n — 1)J or [log 2 (?7 — 1)] , with the latter items closer to 
the head of the queue than the former. Consider now a case 
with n nodes. The first step of coding is to merge two nodes, 
resulting in a combined item that is placed at the end of the 
combined-item queue. Because it is at the end of the queue 
in the reduced problem of size n — 1, this combined node is 
at level [log 2 (n — 1)J in the final tree, and its children are 
at level 1 + |_log 2 (H — 1)J = [~log 2 n]. If n is a power of 
two, the remaining items end up on level log 2 n = [log 2 (n — 
1)], satisfying this lemma. If n — 1 is a power of two, they 
end up on level log 2 (n — 1) = Ll°g2 n J' a l so satisfying the 
lemma. Otherwise, there is at least one item ending up at level 
[~log 2 7i| = [log 2 (rt — 1)] near the head of the queue, followed 
by the remaining items, which end up at level Llog 2 nJ = 
Llog 2 (n— 1)J. In any case, the lemma is satisfied for n items, 
and thus, inductively, for any number of items. ■ 
This lemma applies to any problem in which a two-queue 
Huffman algorithm provides an optimal solution, including 
the original Huffman problem and the tree-height problem of 
[19]. Here we apply the lemma to the exponential Huffman 
algorithm to prove Theorem [TJ 

Proof: [Theorem [Tj We start with an optimal exponential 
Huffman code for a sequence of similar finite weight distri- 
butions. These finite weight distributions, called m-reduced 
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Fig. 1. Formation of a Golomb code using a code for an m-reduced source. In this illustration, m = 17 and k = 5, and smaller weights are pictorially 
lower. Weights are merged bottom-up, in a manner consistent with the exponential Huffman algorithm, first in separate (truncated) unary subtrees, then in a 
(five-leaf) complete tree. 



geometric sources W m , are denned as: 

( {l-0)6\ 0<i<m 

w m (i) 4 \ (i - 0) a e» 

where k is as given in the statement of the theorem, or 1 if 
no such k exists. 

Weights w m (0) through w m (m) are decreasing, as are 
w m (m + 1) through w m (m + k). Thus we can combine the 
nodes with weights w m (m) and w m (m + k) if 

(1 - 6)aO m+k 



and 



1 - aO k 

(1 - 9)a9 rn+k - 1 
1 - aO k 



< 1 



>(1 



or k = 1. 



These conditions are equivalent to the left and right sides, 
respectively, of (|2}. Thus the combined item is 



w m -i(m) = 



{i-ey 



1 - a6 k 

and the code is reduced to the W TO _i case. 

After merging the two smallest weights for m = 0, the 
reduced source is 



w-i(i) 



(1- 



1 - a6 



0<i<k-l. 



For k = 1 (including all instances of the degenerate a < 
0.5 case and all instances in which (f2]i cannot be satisfied), 
this proves that the optimal tree is the truncated unary tree. 
Considering now only k > 1 for m > k — 1, the two-queue 
algorithm assures that, when the problem is reduced to weights 
{w-i(i)}, all corresponding nodes are in the combined-item 
queue. Lemma |2] thus proves that these nodes form a complete 
code. The overall optimal tree for any m-reduced code with 
m > fc — 1 is then a truncated Golomb tree, as pictorially 
represented in Fig. Q] where m = 17 and k = 5. Note that 
m + 1 is the number of leaves in common with what we call 
the "Golomb tree," the tree we show to be optimal for the 
original geometric source. The number of remaining leaves in 
the truncated tree is k, which is thus the number of distinct 
unary subtrees in the Golomb tree. 

Fig. Q] represents both the truncated and full Golomb trees, 
along with how to merge the weights. Squares represent items 
to code, while circles represent other nodes of the tree. Smaller 
weights are below larger ones, so that items are merged as 
pictured. Rounded squares are items m + 1 through m + k, 
the items which are replaced in the Golomb tree by unary 
subtrees, that is, subtrees representing the unary code. Other 
squares are items through m, those corresponding to single 
items in the integer code. White circles are the leaves used for 
the complete tree. 
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(a) a > 1 (b) a < 1 

Fig. 2. Redundancy of the optimal code for the geometric distribution with the exponential penalty (parameter a). R a (Ng , Pg) = L a (Pg, NX ) — 
H a r a \(Pg), where o(a) = (1 + log 2 a)~ , Pg is the geometric probability sequence implied by 9, and Ng is the optimal length sequence for distribution 
Pg and parameter a. 



It is equivalent to follow the complete portion of the code 
with the unary portion — as in the exponential Huffman tree 
in Fig.Q] — or to reorder the bits and follow the unary portion 
with the complete portion — as in the Golomb code [3]. The 
latter is more often used in practice and has the advantage 
of being alphabetic, that is, i > j if and only if c(i) is 
lexicographically after c(j). 

The truncated Golomb tree for any m > k — 1 represents a 
code that has the same penalty for the m-reduced distribution 
as does the Golomb code with the corresponding geometric 
distribution. We now show that this is the minimum penalty 
for any code with this geometric distribution. 

Let Ng a (or N* if there is no ambiguity) be codeword 
lengths that minimize the penalty for the geometric distribution 
(which, as we noted, exist as shown in Theorem 2 of [16]). 
Let N m be codeword lengths for the ?n-reduced distribution 
found earlier; that is, n m (i) is the Golomb length for i < m 
and n m (i) = n m {i — k) for the remaining values. Finally, let 
Noo be the lengths of the code implied by m — » oo, that is, 
the lengths of the Golomb code Gfc. Then 



log a J>( l K* (l) < log a 5>«a"= 



i=0 



i=0 



.(«) 



(3) 



i=0 
m + k 

< log a £ w m (i)a n 'M 

i=0 



where the inequalities are due to the optimality of the respec- 
tive codes and the facts that w m (i) = p(i) for i < m and 

oo oo 

w m (i) = J2{1 - 8)9 l+lk a :i+1 = £ a 1+1 p{i + jk) 

3=0 3=0 

for i £ (m,m + k]. The difference between the exponent of 



the first and the last of the expressions in (0 is 

oo m+k 

X>(*K* (i) - E ™™(*K* W 

oo m+fc 
i= m+1 i=m+l 

As m — > oo for m > k — 1, the sums on the right- 
hand side approach 0; the first is the difference between a 
limit (an infinite sum) and its approaching sequence of finite 
sums, all upper bounded in (0, and each of the terms in 
the second summation is upper-bounded by a multiplicative 
constant of the corresponding term in the first. (In the latter 
finite summation, terms are for i > m + k.) Their difference 
therefore also approaches zero, so the summations on the 
left-hand side approach equality, as do those in ([3J, and the 
Golomb code must be optimal. ■ 

It is equivalent for the bits of the unary portion to be 
complemented, that is, to use {O^^J lb(j mod k, k) : j > 0} 
(as in [4]) instead of {1 Li/^J 06(j mod k,k) : j > 0} (as in 
[3]). It is also worth noting that Golomb originally proposed 
his code in the context of a spy reporting run lengths; this 
is similar to Renyi's context for communications, related in 
Section J] as a motivation for the nonlinear penalty with a < 1. 

A little algebra reveals that, for a distribution Pg and a 
Golomb code with parameter k (lengths Nk), 



L a {Pg,N k ) = log a E(l-^a (ri± ^ 1+9) 



i=0 



(4) 



.9 + loga 1 



(°~1) 



l-a6 k 



where g = [log 2 k\ + 1 and z = 2 9 — k. Therefore, 
Theorem Q] provides the k that minimizes (|4j. If a > 0.5, 
the corresponding Renyi entropy is 



H a <a){Pe) = log a 



1 - 



(1 - 6» Q 0))iMa) 



(5) 
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Fig. 3. Redundancy of the optimal code for the geometric distribution with 
the traditional linear penalty. 

where we recall that a(a) = (l + log 2 a) _1 . (Again, a < 0.5 is 
degenerate, an optimal code being unary with no correspond- 
ing Renyi entropy.) 

In evaluating the effectiveness of the optimal code, one 
might use the following definition of average pointwise re- 
dundancy (or just redundancy): 

R a (N,P)±L a (P,N)-H a{a) (P). 

For nondegenerate values, we can plot the R a (Ng a ,Pg) 
obtained from the minimization. This is done for a > 1 and 
a < 1 in Fig. [2] Note that as a — > 1, the plot approaches the 
redundancy plot for the linear case, e.g., [4], reproduced as 
Fig. [3] 

In many potential applications of nonlinear penalties — 
such as the aforementioned for a > 1 [ 10]— [ 12] and a < 1 
(SectionO} — a is very close to 1. Since the preceding analysis 
shows that the Golomb code that is optimal for given a and 8 
is optimal not only for these particular values, but for a range 
of a (fixing 8) and a range of 8 (fixing a), the Golomb code 
for the traditional linear penalty is, in some sense, much more 
robust and general than previously appreciated. 

IV. Other Infinite Sources 

Abrahams noted that, in the linear case, slight deviation 
from the geometric distribution in some cases does not change 
the optimal code [27, Proposition (2)]. Other extensions to 
and deviations of the geometric distribution have also been 
considered [28]-[30], including optimal codes for nonbinary 
alphabets [27], [29]. Many of these approaches can be adapted 
to the nonlinear penalties considered here. However, in this 
section we instead consider another type of probability distri- 
bution for binary coding, the type with a light tail. 

Humblet's approach [31], later extended in [32], uses the 
fact that there is an optimal code tree with a unary subtree 
for any probability distribution with a relatively light tail, one 
for which there is an r such that, for all j > r and i < j, 
p{i) > P(j) an d P{i) > T^k=j+iP(k)- Due to the additive 
nature of Huffman coding, items beyond r form the unary 



7 

subtree, while the remaining tree can be coded via the Huffman 
algorithm. Once again, this has to be modified for exponential 
penalties. 

We wish to show that the optimal code can be obtained 
when there is a nonnegative integer r such that, for all j > r 

and % < j, 

I \ 

p(i) > max p(j), ^ P( k ) ak 3 ■ 

\ k=j+l J 

The optimal code is obtained by considering the reduced 
alphabet consisting of symbols 0, 1, . . . , r + 1 with weights 

W{t) = \EZ r+ Mk)a k - r , i = r+l. (6) 

Apply exponential Huffman coding to this reduced set of 
weights. For items through r, the Huffman codewords for 
the reduced and the infinite alphabets are identical. Each other 
item i > r has a codeword consisting of the reduced codeword 
for item r+1 (which, without loss of generality, consists of all 
l's) followed by the unary code for i — r—1, that is, i — r—1 
ones followed by a zero. We call such codes unary-ended. A 
pictorial example is shown in Fig. H] for a problem instance 
for which r = 12. 

Theorem 2: Let p(-) be a probability measure on the set of 
nonnegative integers, and let a be the parameter of the penalty 
to be optimized. If there is a nonnegative integer r such that 
for all j > r and i < j, 

p{i) > P(j) (7) 

and 

oo 

p(i) > P(k)a k -i (8) 

k= 3 +i 

then there exists a minimum-penalty binary prefix code with 
every codeword j > r consisting of j — x l's followed by one 
for some fixed nonnegative integer x. 

Proof: The idea here is similar to that for geometric 
distributions, to show a sequence of finite codes which in some 
sense converges to the optimal code for the infinite alphabet. In 
this case we consider the infinite sequence of codes implicit in 
the above; for a given m > —1, the corresponding codeword 
weights are 

( -\ — f i < r + m + 2 

It is obvious that an optimal code for each m-reduced distribu- 
tion is identical to the proposed code for the infinite alphabet, 
except for the item r + m + 2, which is the code tree sibling 
of item r + m + 1. 

For a < 1, we show, as in the geometric case, that 
the difference between the penalties for the optimal and the 
proposed codes approaches 0. In this case, the equivalent of 
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Fig. 4. Formation of a unary-ended infinite code using a Huffman-like code. (Smaller weights are pictorially lower.) Weights are merged bottom-up, in a 
manner consistent with the exponential Huffman algorithm, first in the (truncated) unary subtree, then as in the exponential Huffman algorithm. 



inequality ([3]) is 



2) in terms of p(r 



2): 



lo go < log a £p( i )a"= 



w m (r + m + 2) 



ap(r + m + 2) + a 2 p(r + m + 3) + 



(9) 



i=0 i=0 

r+m+2 

= log a ™m{i)a n ^ 

i=0 
r+m+2 

< l0g a W ™ (*>»' « 

i=0 



where in this case noo(i) denotes a codeword of the proposed 
code, n m (i) = noo(i) for i < r+m+2 and n m (i) = noo(i— 1) 
for i = r + m + 2, and, again, n*(-) denotes the lengths of 
codewords in an optimal code. The corresponding difference 
between the exponent of the first and the last expressions of 
© is 

oo r+m+2 



i=0 



i=0 



= P(i)a n ' (l) -w m (r + jn + 2)a n ' ( - r+m+2 \ 

i— r+m+2 

(10) 

Asm-* oo, both terms in the difference on the second line of 
( [Tol l clearly approach 0, so the terms in (0 approach equality, 
showing the proposed code to be optimal. 

For a > 1, the same method will work, but it is not so 
obvious that the terms in the difference on the second line of 
( ITDl approach 0. Let us first find an upper bound for w m (r + 



p^ a 



i—r—m— 1 



i— r+m+4 

< (a 2 + a)p(r + m + 2) + a 2 p(r + m + 3) 

< (2a 2 + a)p{r + m + 2) 

where the first equality is due to the definition of w m (-), the 
first inequality due to (0, and the second inequality due to ©. 
Thus w m (r + 171 + 2) has an upper bound of (2a 2 + a)p(r + 
m + 2) for all m > —1. In addition, since the proposed code 
has a finite penalty — identical to that of any reduced code 
— the optimal code has a finite penalty, and the sequence 
of its terms — each one of which has the form p(r + m + 
2)a n ( r + TO + 2 ) — approaches as m increases. Thus w m (r + 
m+2)a n ( r + m + 2 ) approaches as well. Due to the optimality 
of n*(-), w m (r + m + 2)a n ( r + m + 2 ) serves as an upper bound 

f° r Etr+m+2P(') ffl " > an d trius b otn terms approach 0. 
As with a < 1, then, the terms in (O approach equality for 
77i — > oo, showing the proposed code to be optimal. ■ 
The rate at which p(-) must decrease in order to satisfy 
condition (|8]l clearly depends on a. One simple sufficient 
condition — provable via induction — is that it satisfy p(i) > 
ap{i + 1) + ap(i + 2) for large i. A less general condition is 
that p(i) eventually decrease at least as quickly as g l where 
g = (v/l + A/a — l)/2, the same ratio needed for a unary 
geometric code for 9 = g, as in (|2j. The ratio g is plotted in 
Fig.0 

For a — > 1, these conditions approach those derived in [31]. 
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0.2197 . . .. After using the appropriate Huffman procedure on 
each reduced source of 4 weights, we find that the optimal 
code for a = 1 has lengths N = {1, 2, 3, 4, 5,6,...} — those 
of the unary code — while the optimal code for a = 2 has 
lengths iV = {2,2,2,3,4,5,...}. 

It is worthwhile to note that these techniques are easily 
extensible to finding an optimal alphabetic code — that is, one 
with c(i)'s arranged in lexicographical order — for a > 1. One 
needs only to find the optimal alphabetic code for the reduced 
code with weights given in equation as in [18], with 
codewords for i > r consisting of the reduced code's codeword 
for r+1 followed by i — r— 1 ones and one zero. As previously 
mentioned, Golomb codes are also alphabetic and thus are 
optimal alphabetic codes for the geometric distribution. 



Fig. 5. Ratio g, probability distribution fall-off sufficient for the optimality 
of a unary-ended code. Note that l/g = $ = A(l + y/E), the golden ratio, 
at a = 1. 



The stronger results of [32] do not easily extend here due 
to the nonadditivity of the exponential penalty. An attempt at 
such an extension in [33, pp. 103-105] gives no criteria for 
success, so that, while one could produce certain codewords 
for certain codes, one might fail in producing other codewords 
for the same codes or for other codes. Thus this extension is 
not truly a workable algorithm. 

Consider the example of optimal codes for the Poisson 
distribution, 



A' 



P\{i) = 

How does one find a suitable value for r (as in Section UVb in 
such a case? It has been shown that r > \eX] — 1 yields p(i) > 
p(j) for all j > r and i < j, satisfying the first condition of 
Theorem |2] [31]. Moreover, if, in addition, j > [2a A] — 1 (and 
thus j > aX — 1), then 



J2p(j + k )a k 



"X' 



fe=i 



< pU) 



pU) 



aX 

J+i 

a A 



2\2 



a 2 X 



a 2 A 2 



2) 



.7 + 1 

aX 



(i + i) 2 



l - 



aA 

j+i 



< 
< 



p(J) 

p(i). 



Thus, since we consider j > r, r = max([2aA] — 2, |~eA] — 1) 
is sufficient to establish an r such that the above method yields 
the optimal infinite-alphabet code. 

In order to find the optimal reduced code, use 



W-i 



(r + l)= £ P (k)a k ~ r 



a -r e A(«-i). 



■£>(*)< 



k—r 



k=0 



For example, consider the Poisson distribution with A = 1. We 
code this for both a = 1 and a = 2. For both values, r = 2, 
so both are easy to code. For a = 1, u>_i(3) = 1 — 2.5e _1 as 
0.0803 . . ., while, for a = 2, w_i(3) = 0.25e - l^Se" 1 as 



V. Application: Buffer Overflow 

The application of the exponential penalty in [11] concerns 
minimizing the probability of a buffer overflowing. It requires 
that each candidate code for overall optimality be an optimal 
code for one of a series of exponential parameters (a's where 
a > 1). An iterative approach yields a final output code by 
noting that, for the overall utility function, each candidate code 
is no worse than its predecessor, and there are a finite number 
of possible candidate codes. Therefore, eventually a candidate 
code yields the same value as the prior candidate code, and 
this can be shown to be the optimal code. This application of 
exponential Huffman coding can, using the above techniques, 
be extended to infinite alphabets. 

In the application, integers with a known distribution P 
arrive with independent intermission times having a known 
probability density function. Encoded bits are sent at a given 
rate, with bits to be sent waiting in a buffer of fixed size. 
Constant b represents the buffer size in bits, random variable 
T represents the probability distribution of source integer in- 
termission times measured in units of encoded bit transmission 
time, and function A(s) is the Laplace-Stieltjes transform of 
T, E[e- sT ]. 

When the integers are coded using N = {n(i)}, the 
probability per input integer of buffer overflow is of the order 
of e~ s b , where s* is the largest s such that 



W,s) < 1 



where 



f(N, S )±A( S )Y,P(*)e sn{l) - 



(11) 



The previously known algorithm to maximize s* is as 
follows: 

Procedure for Finding Code with Largest s* [11] 

1) Choose any so S K+- 

2) j <- 0. 

3) j «- j + 1. 

4) Find codeword lengths Nj minimizing £^ p(i)e Sj - in ( 1 ' . 

5) Compute s 3 = max{s g R : f(Nj,s) < 1}. 

6) If Sj 7^ Sj-i then go to step 3; otherwise stop. 

We can use the above methods in order to accomplish step 
4, but we still need to examine how to modify steps 1 and 5 
for an infinite input alphabet. 
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First note that, unlike in the finite case, s* < oo, that is, 
there always exists an s* £ M + such that, for all s > s*, 
f(N, s) > 1. For any stable system, the buffer cannot receive 
integers more quickly than it can transit bits, so there is a 
positive probability that P[T > 1]. Thus the Laplace-Stieltjes 
transform A(s) exceeds c\e~ s for some constant c\ > 0. 
Also, without loss of generality, we can assume that p(i) is 
monotonic nonincreasing and an optimal n(i) is monotonic 
nondecreasing. This monotonicity means that n(i) > log 2 z, 
and there is no exponential base ao and offset constant c 2 

for which E^oPW^^ ^ a o +C2 for a11 s e M +- Thus 
the summation in (fTTT > must increase superexponentially, and, 
multiplying the A(s) and summation terms, there is an s such 
that f(N, s) > 1 for s > s*. 

For step 1, the initial guess proposed in [11] is an upper 
bound for all possible values of s*. The Renyi entropy of P 
is used to find an initial guess using 



Ms) \J^p(i) T ^ 



l+log 2 e s 



< 



A{s)J2p(i)< 



sn(i) 



,i=0 



i=0 



(12) 

and choosing sq as the largest s such that the left term of 
dTZb is no greater than one. Thus, so > s* for any value of s* 
corresponding to step 5. 

This technique is well-suited to a geometric distribution, for 
which entropy has the closed form shown in equation (0, so 

M*) 19 14-1 7 < f( N > s )- 

However, a general distribution with a light tail, such as the 
Poisson distribution, might have no closed form for this bound. 
One solution to this is to use more relaxed lower bounds on 
the sum — such as using a partial sum with a fixed number 
of terms — yielding looser upper bounds for s* . Another 
approach would be to note that, because of the light tail, the 
infinite sum can usually be quickly calculated to the precision 
of the architecture used. Note, however, that no matter what 
the technique, the bound must be chosen so that s is an 
real number and not infinity. Partial sums may be refined to 
accomplish this. 

In calculating f(N, s) for use in step 5, the geometric 
distribution has the closed-form value for / obtainable from 
equation while the other distributions must instead rely on 
approximations of /. As before, this is easily done due to the 
light tail of the distribution. Alternatively, a partial sum and 
a geometric approximation can be used to bound f(N, s) and 
thus s*, and these two bounds used to find two codes. If the 
two codes are identical, the algorithm may proceed; otherwise, 
we must roll back to the summation and improve the bounds 
until the codes are identical. 

These variations make the steps of the algorithm possible, 
but the algorithm itself must also be proven correct with the 
variations. 

Theorem 3: Given a geometric distribution or an input dis- 
tribution satisfying the conditions of Theorem [2] for a = e s °, 
where so is an upper-bound on s*, the above Procedure for 
Finding Code with Largest s* terminates with an optimal code. 



Proof: The number of codes that can be generated in the 
course of running the algorithm should be bounded so that 
the algorithm is guaranteed to terminate. Optimality for the 
algorithm then follows as for the finite case [11]. As in the 
finite case, Sj+i > Sj for j > 1 (but not j = 0) due to step 
5 [f(N j} 8j) < 1], step 4 [/(JV i+1> s 3 -) < f(N 3 , Sj )], and the 
definition of Sj+i- 

In the case of a geometric distribution, each Nj is a Golomb 
code Gkj for some positive integer kj. Clearly, if we choose 
so as detailed above, it is the greatest value of Sj, being either 
optimal or unachievable due to its derivation as a bound of 
the problem. Since Gki (with lengths Ni) is the optimal code 
for the code with exponential base a = e s * _1 , © means that 



< e - S »_i < Qki-X + gki^ and thus 
(1 +8)0 kl < e- S0 < e-^- 1 < (1 



and, since 9 < 1, we have kj — 1 < k\ (or, equivalently, 
kj < ki) for all j > 1. Therefore, there are only k\ possible 
codes the algorithm can generate. 

In the case of a distribution with a lighter tail, the minimum 
r of Theorem[2]increases with each iteration after the first, and 
the first r\ (corresponding to sq) upper bounds the remaining 
fj. Thus all candidate codes can be specified by their first 
7*1 codeword lengths, none of which is greater than r\. The 
number of codes is then bounded for both cases, and the 
algorithm terminates with the optimal code. ■ 

VI. Redundancy penalties 

It is natural to ask whether the above results can be extended 
to other penalties. One penalty discussed in the literature is that 
of maximal pointwise redundancy [34], which is 

R*(N,P)±sup[n(i)+log 2 p(i)] 
iex 

where we use sup when we are not assured the existence of a 
maximum. This can be shown to be a limit of the exponential 
case, as in [23], allowing us to analyze its minimization using 
the same techniques as exponential Huffman coding. This 
limit can be shown by defining dth exponential redundancy 
as follows: 



Rd(N, P) 



A 3log 2 ^p(i)2 d(n(8)+log ^ W) 



iex 



iex 

Thus R*(N, P) = lim^oo R d {N, P), and the above methods 
should apply in the limit. In particular: 

Theorem 4: The Golomb code Gfc for k = |~— l/log 2 6 | ] is 
optimal for minimizing maximal pointwise redundancy for Pg . 

Proof: 

Case 1: Consider first when —l/\og 2 is not an integer. We 
show that k = |~— l/log 2 0] is optimal by finding a D such 
that, for all d > D, the optimal code for the dth exponential 
redundancy penalty is Gfc. For a fixed d, © implies that such 
a code should satisfy 



il+d\k+l 



1 



jl+d\fe-l 



f- L + {9 



l+d\k 



(13) 
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(a) 9 e (0.5, 1) (b) 6 6 (2-°- 1 ,2- - 001 ), with x-axis <x log 2 (-l/log 2 0) 

Fig. 6. Maximal pointwise redundancy of the optimal maximal redundancy code for the geometric distribution, solid (with discontinuities represented by 
dashed); optimal dth exponential redundancy for the geometric distribution, dotted for d = {1, 2, 4, 16, 256, 65536}, from lowest to highest. 



and thus we wish to show that this holds for all d > D. 
Consider fc = \— l/log 2 0]. Clearly, fc > — l/log 2 0, or, 
equivalently, 



1 

<2- 



(14) 



Now consider 



D = -l 



1 + (fc - l)log 2 

so that (fc - l)log 2 € (-1,0] and therefore D > 0. Taken 
together with the fact that 9 € (0, 1), (O yields 9 dk < 2~ d 
and (1 + 6 1+d )0 k < 29 k < 1. Multiplication yields the left- 
hand side of ( fT3T > for any d > D. For any such d, algebra 
easily shows that we also have the inequality (29 k ~ 1 ) 1+d > 2, 
yielding 



-{29 k ) 1+d 



= -(29 k ) 1+d (9- 1 - d 

= i(26» fe - 1 ) 1+d (l + ( 
> 1. 



This is equivalent to the right-hand side of inequality (fT~3T > for 
the values implied by the definition of Rd(N,P). Then Gk 
is an optimal code for d > D, and thus for the limit case of 
maximal pointwise redundancy. 

Case 2: Now consider when — l/log 2 is an integer. It 
should be noted that, for the traditional (linear) penalty, these 
are precisely the k values that Golomb considered in his 
original paper [3] and that they are local infima for the 
minimum maximal pointwise redundancy function in 9, as in 
Fig. [6] Here we show they are local minima. 

Since 9 = 0.5 is a dyadic probability distribution and thus 
trivial, we can assume that 9 > 0.5. We wish to show that 
optimality is preserved in these right limits of Case 1. Note 
that, for each i with fixed N, 



lim \n(i) 



logaPfl'(i)] = n(i) + log 2 p e (i). 



This is of particular interest for the value of i maximizing 
pointwise redundancy for Gfc at 9', where 9' e (6» 1/log22e , 6»), 
allowing us to use the right limit of 9. Let i** = 2 ^ lo S2 ^1 _ k, 
the smallest i which has codeword length exceeding the 
codeword length for item 0. Clearly the pointwise redundancy 
for this value is greater than that for all items with i < i**, 
since they are one bit shorter but not more than twice as 
likely. Similarly, items in (i**,k) have identical length but 
lower probability, and thus smaller redundancy. For items with 
i > k, note that the redundancy of items in the sequence {j, j+ 
k,j + 2k, . . .} for any j must be nonincreasing because the 
difference in redundancy is constant yet redundancy is upper- 
bounded by the maximum. Thus i** maximizes pointwise 
redundancy for Gfc at 9'. 

We know the pointwise redundancy of i** for Gfc at 9, 
although we have yet to show that i** yields the maximal 
pointwise redundancy for Gfc at 9 or that Gfc minimizes max- 
imal pointwise redundancy. However, for any code, including 
the optimal code, as a result of pointwise continuity, 

sup [n{i) + \og 2 p 9 (i)\ > n(i**) + \og 2 p 9 (i**) 

= Um[n(i**)+lo ga Mn]- 

From the above discussion, it is clear that the right-hand 
side is minimized by the Golomb code with fc = — l/log 2 #, 
so, because the left-hand side achieves same value with this 
code, the left-hand side is indeed minimized by Gfc. Thus 
this code minimizes maximal pointwise redundancy for 9. The 
corresponding maximal pointwise redundancy is 

m&x t [n* g *(i) +log 2 p e (i)] 

= n|*(2r io S2fel - fc) + \og 2 p e (2^°^ - fc) 

= [log 2 fcl + 1 + log 2 (l - 9) + (2^*1 - fc)log 2 c? 

where Ng* = {n* g *(i)} is defined as the lengths of a code 
minimizing maximal pointwise redundancy. Note that this is 
the redundancy for all items i = 2 r io S2 fc l + jk with integer 
j > -1. ■ 
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It is worthwhile to observe the behavior of maximal point- 
wise redundancy in a fixed (not necessarily optimal) Golomb 
code with length distribution N k - The maximal pointwise 
redundancy 

R*(N k ,P e ) = sup [n k (i) + logaf»fl(*)] 

decreases with increasing 8 — and is an optimal code for 9 £ 
(2- 1 /(fc-i) j 2- 1 / fe ] — until 9 exceeds 2^ 1 / fc , after which there 
is no maximum, that is, pointwise redundancy is unbounded. 
This explains the discontinuous behavior of minimum maximal 
redundancy for an optimal code as a function of 9, illustrated 
in Fig. [6] where each continuous segment corresponds to an 
optimal code for 9 g (2- 1 /(fc-i) ; 2- l l k ]. 

Note also the oscillating behavior as 9 | 1. We show in 
Appendix U that liminfgji R*(Ng* , Pg) = 1 — log 2 log 2 e and 
limsupg^ R*(Ng*, Pg) — 2 — log 2 e, and we characterize 
this oscillating behavior. This technique is extensible to other 
redundancy scenarios of the kind introduced in [23]. 

For distributions with light tails, one can use a technique 
much like the technique of Theorem|2]in SectionHV] First note 
that this requires, as a necessary step, the ability to construct 
a minimum maximal pointwise redundancy code for finite 
alphabets. This can be done either with the method in [34] or 
any of those in [23], the simplest of which uses a variant of 
the tree-height problem [19], solved via a different extension 
of Huffman coding. Simply put, the weight combining rule, 
rather than w(j) + w(k) or a ■ (w(j) + w(k)), is 

w(j) = 2 • max(w(j), w(k)). (15) 

This rule is used to create an optimal code with lengths 
ATM for WW 4 {p(0),p(l), . . . ,p(r), 2p(r + 1)}, assuming 
a unary subtree for items with index i > r (and no other 
items) is part of an optimal code tree. As in the coding 
method corresponding to Theorem [2] the codewords for items 
through r of this reduced code are identical to those of 
the infinite alphabet. Each other item i > r has a codeword 
consisting of the reduced codeword for r + 1 followed by the 
unary code for i — r — 1, that is, i — r — 1 ones followed by 
a zero. 

A sufficient condition for using this method is finding an r 
such that 

for all i < r, p(i) > p(r) 

and 

for all j > r, p(j) > 2p(j + 1). 

For such j, pointwise redundancy is nonincreasing along a 
unary subtree, as 

n(j)+log 2 p(j) = n(j + 1) + log 2 (p(i)/2) 
> n(j + l)+log 2 p(i + l). 

The aforementioned coding method works because, for each 
j, an optimal subtree consisting of the items with index i > j 
and higher has n(i) = n(j) — j + i; this subtree is optimal 
because the weight of the root node of any subtree cannot be 
less than 2p(j). A formal proof, similar to that of Theorem[2] 
is omitted in the interest of space. 



For a Poisson random variable, r = \eX\ — 1 satisfies this 
condition, since, for i < r < j, p(i) > p(r) (as in [31]), and 

p(j) = > ^-P(j+1) > ep(j+l) > 2p(i+l). 

Thus such a random variable can be coded in this manner. 

Note that other sufficient conditions can be obtained through 
alternative methods. One simple rule is that any code for 
which p(i) < 2~ z p(0) for all z > will necessarily 
have n(0) + log 2 p(0) minimized by letting n(0) = 1, 
and this will be the maximum redundancy if n(i) = i — 
1 in general. For example, a unary tree optimizes P = 
{0.6, 0.15, 0.15, 0.0375, 0.0375, . . .}, since log 2 1.2 w 0.263 is 
a lower bound on maximal pointwise redundancy for any code 
given p(\) = 0.6, and this bound is achieved for the unary 
code. If viewed as a rule for a unary subtree, this is looser than 
the above, since, unlike linear and exponential penalties, not 
all subtrees of the subtree need be optimal. Other relaxations 
can be obtained, although, as they are usually not needed, we 
do not discuss them here. 

VII. Conclusion 

The aforementioned methods for coding integers are ap- 
plicable to geometric and light-tailed distributions with ex- 
ponential and related penalties. Although they are not direct 
applications of Huffman coding, per se, these methods are 
derived from the properties of generalizations of the Huffman 
algorithm. This allows examination of subtrees of a proposed 
optimal code independently of the rest of the code tree, and 
thus specification of finite codes which in some sense converge 
to the optimal integer code. Different penalties — e.g., tp(x) = 
x 2 , implying the minimization of vZ-TTpM^CO^ — d° not 
share this independence property, as an optimal code tree with 
optimal subtrees need not exist. Thus finding an optimal code 
for such penalties is more difficult. There should, however, be 
cases in which this is possible for convex tp which grow more 
slowly than some exponential. 

Another extension of this work would be to find coding 
algorithms for other probability mass functions under the non- 
linear penalties already considered, e.g., to attempt to use the 
techniques of [33, pp. 103-105] for a more reliable algorithm. 
Other possible extensions and generalizations involve variants 
of geometric probability distributions; in addition to the one 
we mentioned that is analogous to Proposition (2) in [27], there 
are others in [28]-[30]. Extending these methods to nonbinary 
codes should also be feasible, following the approaches in 
[27] and [32]. Finally, as a nonalgorithmic result, it might be 
worthwhile to characterize all optimal codes — not merely 
finding an optimal code — as in [26, p. 289]. 
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Appendix I 

Optimal Maximal Redundancy Golomb Codes for 
Large 9 

Let us calculate optimal maximal redundancy as a function 
of 9 > 0.5: 



R*(N**,P e ) 



maxi n* *(i) + log 2 p 9 (i) 
Iog 2 (l-fl) 

2[ lo s2r--E 



1 



io g2 e 




l0 S2 (-15^ 

2[ 1 °g2(-iSHe)]-l°g2(-is ii 8^ i 
2 + log 2 (-l^) 



log 2 9 



log 2 6»- 



where (x) denotes the fractional part of x, i.e., (x) = x — |_ ;r J> 
since 



log 2 p 



1 



log 2 



ll 



log 2 - 



1 



log 2 6> 



for 9 > 0.25 (and thus for 9 > 0.5). Using the Taylor series 
expansion about 9 = 1, we find 

i^) = -log 2 log 2e -(log 2 ^)(l-0)+O((l-0) 2 ) 

where e is the base of the natural logarithm. Additionally, 
1 



log 2 



iog 2 e» 



log 2 0=l + O(l-0). 



2 - 2 1 ~ (log2( ~ E i^ )> 



Note that this actually oscillates between 1 and 1 + (1 — 0)log 2 e 
in the limit, so this first-order asymptotic term cannot be 
improved upon. However, the remaining terms 

oscillate in the zero-order term. Assigning x = 
(log 2 (— l/log 2 0)), we find that ( TToT l achieves its minimum 
value, 0, at and 1. The maximum point is easily 
found via a first derivative test. This point is achieved at 
x = 1 — log 2 log 2 e, at which point ( [Tol l achieves the maximum 
value 1 — log 2 e + log 2 log 2 e. Thus, gathering all terms, 



liminfii* (Ng**,P e 

911 



1 - log 2 log 2 e = 0.4712336270 . 



limsup R*(N$*,P e ) = 2- log 2 e = 0.5573049591 
0T1 



and, overall, 

R*(Ng*,P e ) 



3 - log 2 log 2 e - 

2 1 -< 1 °S2(-TSi 2 T?)> 

0(1-9). 



log 2 



1 



log 2 6» 



This oscillating behavior is similar to that of the average 
redundancy of a complete tree, as in [35] and [36, p. 192]. 



Contrast this with the periodicity of the minimum average 
redundancy for a Golomb code: [37] 



R( N ln p e) = 1 - log 2 log 2 e - log 2 e 

2 2_ 2 i-(i° g 2(-B^)> _ 

0(1-6) 



log 2 - 



1 



log 2 9 



where N@ 1 is the optimal code for the traditional (linear) 
penalty. 





Appendix II 




Glossary of Terms 


Notation 


Meaning 


a 


Base of exponential penalty 


b(x, k) 


(x + l)th codeword of complete binary code 
with k items (i.e., the order-preserving 
[alphabetic] code having the first 2^ og2fc l — k 
items with length [log 2 fcJ and the last 
2k- 2l" lo S2 fc l hems with length [log 2 fc~|) 


c(i) 


Codeword (for symbol) i 


C 


Code {c(i)} 


e 


Base of the natural logarithm (e w 2.71828) 


Gk 


Golomb code with parameter k, one of the 




form {lLi/fcJ ofo(j mod k, k) : j > 0} 
Renyi entropy (1 — a) _1 log 9 Y\ icy p(i) a 
(or, if a G {0, 1, oo}, the limit of this) 


H a (P) 


i** 


Index of the codeword that, among a 
given code's inputs i € X, maximizes 
pointwise redundancy, n(i) + \og 2 p(i) 


j mod k 


3-k{j/k\ 


L a (P,N) 


Penalty log a E lS *:PWa™ W 


n(i) 


Length of codeword (for symbol) i 


N 


{n(i)}, the lengths for a given code 


n^(i) 


Length of codeword i of an optimal code 
minimizing maximum redundancy for 




{rj( r )(i)}, the lengths of an optimal code 
minimizing maximum redundancy for 


n*(i) 


Length of codeword i of an optimal code 
for an exponential penalty, L 




(...if 9 and a are specified) 


N* 


{n*(i)}, the lengths of an optimal code 




(...if 9 and a are specified) 


«M,d(») 


Length of codeword i of an optimal code 
minimizing c?th exponential redundancy 


N* 


{rig a d (i)}, the lengths of an optimal code 
minimizing c?th exponential redundancy 


n**(i) 


Length of codeword i of an optimal code 
minimizing maximum redundancy 


TV** 


{n**(i)}, the lengths of an optimal code 
minimizing maximum redundancy 


O(-) 


Order of • asymptotic complexity 


p(i) 


Probability of input symbol i 
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continued 




[13] 


Notation 


Meaning 


\P0\ L )) 


( for (TPOTnptrip HiQt 1 \snth tiflrflmpfpr ff\ 


[14] 




( for Poisson Hist 1 with narameter \\ 


p 


4 Tif ? I I thp inmit nrohnhilitv tti^qq function 

l l v J f , L11C 111 L/ LLL Ul yjUcXUlllL V llltLSO 1 LL11V- L1VJ11 




(Pf>) 


(...for geometric dist r with parameter 9) 






(...for Poisson dist r with parameter A) 


[16] 


Rn ( N, P) 


L n (P N) - H„r„\(P) the average 




pointwise redundancy 


[17] 


VI, (N PI 
-tidy 1 * 7 'J 


a l0 %2 2^iexP\ l ) Z > 






111V„ (XL11 V~ A L'V '1 1 1 11 Lll 1 V~ Li LI 1 ILltll V 


[18] 


R*(N P) 


\> \ ti ( i -4- lnP'^T)f ? ll the maximum 






UU111LW1SC 1 C LI LI 1 1 LI u 1 1 L \ 


[19] 


Iff 


1 lie aCL Ul leal liuiuucla 


[20] 


M + 


TTif* spt of nnsitivp rpnl nnmhpK 

1 1 VJ1 UUillll V ^/ Ivtll 11U111U^/1 J 


s 


ULMJC1 UUU11U Ull o 


[21] 


s* 


In a for a corresponding to optimal coding 


[221 




for buffer overflow 


w(i) 


Weight (for symbol) i 




W 


{w(i)}, the set of weights 


[23] 


w^(i) 


for i <^ r, 2p(r -(- 1) for i = v -\- 1 






inlT)) w(1) nO) 2»(V + 1H 


|Z4| 


X 


Input alphabet (usually A'oo = {0, 1, . . .}) 


[25] 


a(a) 
9 


1/(1 + log 2 a) (parameter for Renyi entropy) 
Geometric dist 1 " parameter (pg(i) = (1 — 6)9 % ) 


A 


Poisson dist r parameter (p\(i) = A l e /i!) 


[261 


$ 


Golden ratio, ±(1 + \/5) 


[27] 
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